Skip to content

Backfill related_ingredients via full-text re-fetch (Kombucha, Lotus, OMM12) (#30)#154

Merged
realmarcin merged 1 commit into
mainfrom
backfill/related-ingredients-fulltext-refetch
Jun 16, 2026
Merged

Backfill related_ingredients via full-text re-fetch (Kombucha, Lotus, OMM12) (#30)#154
realmarcin merged 1 commit into
mainfrom
backfill/related-ingredients-fulltext-refetch

Conversation

@realmarcin

Copy link
Copy Markdown
Contributor

Applies the (b) re-fetch technique with full text: for stub-cache-limited and abstract-poor files, pulled Europe PMC OA fullTextXML, folded the body into the canonical PMID_<id>.md caches (so snippets stay verifiable), then backfilled. 14 new CHEBI-grounded ingredients across 3 files.

File new from
Kombucha (PMID_26061774.md enriched) +4 glucose, acetic acid, lactic acid, lactose (added to existing sucrose/cellulose → 6 total)
Lotus_LjSC3 (PMID_34312531.md created) 9 defined artificial-root-exudate medium: sugars, organic acids, amino acids
OMM12 (PMID_29051233 + 31998276 enriched) 1 complex carbohydrates (bile acid/mucin correctly rejected — only in cited-ref titles / host machinery, not OMM12 substrates)

Full-text fetch outcomes

  • Recovered & backfilled: Kombucha, Lotus, OMM12.
  • Not in Europe PMC OA (404): Cheese_Rind, Maize_Root, Infant_Gut_Phage, hCom2 — full text not retrievable this way.
  • Fetched but genuinely no chemistry: Altered_Schaedler (confirmed un-backfillable).

Verification

  • 16/16 labels OAK-canonical; 16/16 snippets exact substrings of the enriched caches
  • linkml-validate all 3 → exit 0
  • just validate-products (blocking gate) → exit 0 (5461 OK_CANONICAL)

Adoption: 207 → 209 / 265 (Lotus + OMM12 newly populated; Kombucha enriched).

🤖 Generated with Claude Code

… OMM12) (#30)

Applies the (b) re-fetch technique with FULL TEXT (not just abstracts): pulled
Europe PMC OA fullTextXML for stub-cache-limited / abstract-poor files, folded
the body into the canonical PMID_<id>.md caches, then backfilled. 14 new
CHEBI-grounded ingredients:

- Kombucha (PMID_26061774.md enriched): +4 (glucose, acetic acid, lactic acid,
  lactose) on top of the existing sucrose/cellulose
- Lotus_LjSC3 (PMID_34312531.md created from full text): 9 — the paper's defined
  artificial-root-exudate medium (glucose, fructose, sucrose, succinic acid,
  sodium lactate, citric acid, serine, alanine, glutamic acid)
- OMM12 (PMID_29051233.md + PMID_31998276.md enriched): 1 (complex carbohydrates;
  bile acid/mucin correctly rejected — only in cited-ref titles / host machinery)

Caches updated in-repo so snippets remain verifiable/reproducible (validator
reads .md primary). Not recoverable via Europe PMC OA (404 / not OA): Cheese_Rind,
Maize_Root, Infant_Gut_Phage, hCom2. Altered_Schaedler full text fetched but
genuinely names no metabolites — confirmed un-backfillable.

Verified: 16/16 labels canonical, 16/16 snippets exact substrings of the
enriched caches, all 3 pass linkml-validate, `just validate-products` exits 0.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@realmarcin realmarcin merged commit b80763c into main Jun 16, 2026
3 checks passed
@realmarcin realmarcin deleted the backfill/related-ingredients-fulltext-refetch branch June 16, 2026 06:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant